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Abstract 

Background: The comparison of relative gene orders between two genomes offers deep insights into functional 
correlations of genes and the evolutionary relationships between the corresponding organisms. Methods for gene 
order analyses often require prior l<nowledge of homologies between all genes of the genomic dataset. Since such 
information is hard to obtain, it is common to predict homologous groups based on sequence similarity. These 
hypothetical groups of homologous genes are called gene families. 

Results: This manuscript promotes a new branch of gene order studies in which prior assignment of gene families 
is not required. As a case study, we present a new similarity measure between pairs of genomes that is related to 
the breakpoint distance. We propose an exact and a heuristic algorithm for its computation. We evaluate our 
methods on a dataset comprising 12 y-proteobacteria from the literature. 

Conclusions: In evaluating our algorithms, we show that the exact algorithm is suitable for computations on small 
genomes. Moreover, the results of our heuristic are close to those of the exact algorithm. In general, we 
demonstrate that gene order studies can be improved by direct, gene family assignment-free comparisons. 



Background 

In the field of comparative genomics, studying the rela- 
tive order of genes in genomes is a popular practice to 
gain information about organisms and their relation- 
ships. This information ranges from transcription and 
functional linkage of genes such as correlated expres- 
sion, the phylogeny of organisms, to detailed evolution- 
ary dynamics of their genomes. Gene order methods are 
also incorporated in genome alignment strategies to 
identify regions that are subsequently used to anchor 
the alignment [1]. 

Genes are the atomic elements in gene order studies. 
Although no precise, formal definition is generally agreed 
upon, from the biological point of view a gene represents 
a specific inheritable entity in a particular locus on a 
chromosomal sequence in a particular organism. It often 
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features a protein coding region. Nevertheless, the notion 
of a "gene" can also represent more fine-grained genetic 
structures such as protein domains or other functional 
elements of the genome. 

Gene families. Many gene order studies hope for evo- 
lutionary relationships being resolved between all pairs of 
genes. Rested upon the biological concept of homology, 
such studies require information about orthology, paral- 
ogy and (potentially) xenology for each pair of genes in 
the dataset. This information is generally not given, 
hence it is common to cluster genes according to their 
sequence similarity. Sometimes such groups are called 
gene families, thus we will stick to this notion in the 
following. 

Various databases exist, such as COG [2], eggNOG [3], 
Inparanoid [4], TreeFam [5], and OrthologID [6] (only to 
name a few) that offer gene family information. These 
databases can be divided into two groups: databases that 
primarily use sequence similarity to cluster genes into 
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groups of co-orthologs; and tree-based databases offering 
reconstructed gene family trees [7]. 

The former group of databases provides usually more 
gene family data while covering a larger set of species. 
However, the contained information should always be 
taken with a pinch of salt: Even though high sequence 
similarity is a good indicator of homology, per se these 
gene families do not reflect an evolutionary relation. 
This is because they depend on arbitrary parameters of 
sequence comparison, similarity quantification, and clus- 
tering. Generally such parameters are user-controlled 
and influence the size and granularity of the computed 
gene families. Yet, the vast majority of these databases is 
uncurated or offers only a negligible amount of curated 
data. 

Lacking a gene tree, within these gene families no differ- 
entiation can be made between in- and out-paralogs when 
comparing a specific pair of genomes. As is well-known, 
gene duplication and sub- or neofunctionalization occurs 
frequently in evolution. Hence the number of co-ortholo- 
gous genes in a genome that are pooled into the same 
gene family grows the higher one ascends in the evolution- 
ary tree. With increasing number of diverse genomes in 
the database, these gene families become less useful for 
gene order analyses, if only a close subset of taxa is of 
interest. The blemish of disregarding the evolutionary tree 
needed for truly resolving evolutionary relationships 
between genes of a given set of genomes is often covered 
by offering varying levels of granularity. This means that 
for some subtrees (but generally not for all) of the gen- 
omes in the database, gene families are recomputed with 
tighter parameters. Moreover, the computed sequence- 
based similarity estimates are rarely based on models of 
DNA evolution as these involve considerably more com- 
putational load. Subsequently differential evolutionary 
rates are disregarded, amplifying the dilemma of grouping 
genes based on sequence similarity: selecting too loose cri- 
teria in clustering genes to gene families may lead to the 
mistake that two genes are assigned to the same gene 
family while they are not homologous, whereas too strict 
criteria can split gene families although they should belong 
together [7]. 

Tree-based databases such as TreeFam and OrthologID 
may provide more accurate information desired for gene 
order studies. This is partly because the evolutionary rela- 
tionships between genes in a gene family are considerd in 
more detail. Furthermore the species tree is taken into 
account while reconstructing the gene family trees. Also, 
tree-based databases tend to be more often manually 
curated than their sequence similarity based counterparts. 
In return, the provided gene family information is often 
sparse and covers not all genes of a genome. Moreover, 
such databases usually comprise only a handful of species. 
As a result, they are of limited use in gene order studies. 



Gene content variations. Apart from model-free 
comparison or well-defined rearrangements in genomes, 
gene order studies can allow for additional biologically 
motivated operations of evolution. That is, genes can 
duplicate, emerge or become lost in the genome. Simi- 
larly, a gene family can grow or shrink, or new gene 
families can arise. 

Gene order studies. Based on the concept of gene 
families, many gene order studies share a common data 
structure where chromosomes are represented as words 
drawn from a finite alphabet of gene families. The strength 
of this data structure lies in its simplicity; it allows to study 
the corresponding gene order problems in an abstract 
form composed of permutations or sequences over a set 
of characters. Another important advantage is the fact that 
homology is a binary and transitive relation. This led to 
the emergence of a multitude of efficient algorithms which 
solve gene order problems combinatorially. 

In the following we will briefly review three different 
types of gene order studies. Dissimilarity measures such as 
the breakpoint distance [8] are used to calculate evolution- 
ary distances between two or more genomes, without 
explicitly drawing on rearrangement operations. The 
breakpoint distance is defined by the number of uncon- 
served adjacencies between characters of two genomes. 
For gene cluster detection, several competing models 
exist. One of them is based on the notion of approximate 
common intervals [9]. Thereby a gene cluster is defined as 
a set of maximal intervals, on two or more genomes, that 
share the same character set. Small differences between 
the set of characters constituting the gene cluster and the 
set of characters within the intervals are allowed. The 
number of tolerated differences as well as the minimal size 
of an interval is determined by a user-controlled para- 
meter. Finally, a group of popular rearrangement models 
are based on the so-called double-cut-and-join (DCJ) 
operation [10,11]. By disrupting the genome on two differ- 
ent positions and rejoining the resulting ends, one aims to 
transform one genome into another by a minimal 
sequence of DCJ operations. This sequence is denoted 
sorting scenario. 

Limits of the gene family concept. The concept of 
gene families comes with much benefit, but also has its 
detriments. On the one hand, gene family information can 
be gained with comparatively low effort by accessing var- 
ious public databases or by direct computation. On the 
other hand, comparative studies based on uncurated gene 
families are hampered since data can be incorrect. 

There are many reasons why the exclusive, binary mem- 
bership relation between genes and gene families is dispu- 
table in itself For one, most gene families are uncurated, 
hence it would be supporting in constitutive analyses to 
distinguish between weak and strong assumptions of 
homology between genes in supporting their membership 
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to one or more gene families. Moreoever, the gene family 
concept disregards the facts that gene families may share 
conserved protein domains and that genes may fuse with 
others in the course of evolution. 

In this paper we promote the idea that gene order stu- 
dies can be performed without prior gene family assign- 
ment. We propose direct use of similarity values because 
such information not only allows to make more substan- 
tiated choices in resolving gene order in subsequent ana- 
lyses, but can sometimes better reflect the biological 
reality. In support of our case, we present a new approach 
to calculate the number of conserved adjacencies, which is 
a similarity measure related to the breakpoint distance, 
without the use of gene families. Our method is based on 
a weighted bipartite graph, representing pairwise similari- 
ties between genes of two genomes. We show that this 
allows for stable adjacency analyses when similarities are 
calculated based on sequence similarity. 

In the "Methods" section we will introduce the problem 
setting formally and devise an exact algorithm as well as a 
heuristic for its solution. In the "Experiments and Discus- 
sion" section we discuss the performance of our presented 
method on this dataset and compare results with former 
work. The manuscript closes with concluding remarks and 
future prospects in the "Conclusions" section. 

Methods 

Formal problem description 

Genome model. Let Q be the universe of all genes, then 
a chromosome is defined as a sequence of genes (°, gi, 
g2< ■■■< gn-\°)^ with gi e Q for all / = 1, « - 1, flanked 
by telomeric ends represented by Depending on the 
type of gene order study, chromosomes can be signed 
or unsigned. If signed, a gene g has a direction indicated 
by -g or +g (but it is common to omit the "+"), which 
represents the relative orientation of each gene along 
the chromosome. A chromosome can also be circular as 
it is often observed in bacteria; in this case, it does not 
exhibit telomeric ends, implying that the outermost 
genes adjoin. For the time being, let us assume that a 
genome is unichromosomal and linear, since the general 
case of our model can be easily inferred. The size of a 
genome G with « - 1 genes is |G| = «. In order to refer 
to the /th gene of G, we use the notation G{i]. Further, 
let cr : C/ X C/ —>• [0, 1] be a normalized similarity mea- 
sure between all pairs of genes. 

Graph representation. Given two genomes Gi, Gn of 
lengths «i = I Gil and «2 = IG2I, respectively. We define an 
ordered weighted bipartite graph B = (Gi, G2, E) over both 
genomes in which the order is given by the chromosomal 
order of genes (see example in Figure 1(a)). For 0 < / < «i, 
0 < < «2> a pair of genes, one from Gi and one from G2, 
is connected by an edge e,-^ := {Gi[i], G2[/c]) e£ with 
edge weight w{eii^):= a(Gi[j], G2M) if and only if a(Gi[i], 



G2[k\) > 0. Telomeres are always connected with edges of 
weight 1: w{eoo) = w[eonJ = w(enio) = ^{en^nj) = 1. as 
depicted in our example in Figure 1(a). We call a gene g, 
e Gx, X e {1, 2} unconnected if there exists no gene g^ e 
Gy, y e {{l,2}\{x}) such that o{gi,gk) > 0. 

Unconnected genes are omitted from the chromoso- 
mal sequences. The remaining genes form connected 
components of size two or larger. Let C denote the set 
of all such connected components of B, then for some 
C e C and x e {1,2}, denotes the set of all genes of C 
that are part of G^. Given B, we will be interested in 
finding a set of disjoint edges. Such a set, denoted by 
M, is known as matching. 

Matchings. Let us assume for now that a matching M 
between Gi and G2 is given. #edg (A^) denotes the 
number of edges in J\/[- We call a gene saturated if it is 
incident to an edge of the matching. A pair of genes {G^ 
[i], Gx\i]), with X e {1,2} and 0 < i < J < n„ is a consecu- 
tive pair if no saturated gene lies between them. 

Recall that genes have directions; the orientation of a 
gene g is determined by the following function: 



1 ifg > 0 
-1 \ig < 0 
0 if^ is a telomere 



Two consecutive pairs of genes (Gi[i], Gi[/]) and (G2 
[k], G2[e]), with 0 < i < 7 < Ki and 0 < k, £ < n2, form a 
conserved adjacency if the corresponding edges e,-^, Cji 
are part of J\4 and: 

1. for k < e, sgn{GM) = sgn{G2[k]) and sgn{Gi[j]) = 
sgn{G2[e]) or 

2. for k > £, sgn{Gi[i]) * sgn[G2[k]) and sgn{Gi\j]) ^ 
sgn{G2[e]). 

For example, in Figure 1(b) the consecutive gene pairs 
(2, -3) and (6, -7) represent a conserved adjacency. Telo- 
meres located at the first and last position of the chromo- 
somes are "unsigned" and thus can be used to form 
adjacencies in both directions. We denote the sum of all 
conserved adjacencies in a matching M. by #adj(Al). 

Among all possible matchings between Gi and G2, we 
search the biologically most relevant. A well-known 
matching is the maximal weighted matching, which maxi- 
mizes the sum of weights of disjoint edges of a bipartite 
graph. In our example. Figure 1(b) represents a maximal 
weighted matching. This kind of matching can be moti- 
vated from a biological point of view: The higher the 
sequence similarity between two genes, the more likely 
they are homologs. Yet, if we want to construct a biologi- 
cally meaningful matching, we must not only consider 
edge weights, but also the ability of two edges forming a 
conserved adjacency in the final matching. We somehow 
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Gjo 5 6-78 -9 ° G2''5 6-7 -9° Gjo 

(a) (b) 

Figure 1 Bipartite graph and matchilng. (a) The ordered weighted bipartite graph represents similarity relationships between genes of two 
genomes, (b) and (c) represent solutions to Problem 1 with a — > 0 and a = 1 respectively. 



want to maximize for the number of conserved adjacen- 
cies in the final matching, because we observe from biolo- 
gical data that rearrangements of genes in genomes occur 
parsimoniously. However, we want to prevent that con- 
served adjacencies incorporating low-weight edge pairs are 
formed if the corresponding genes are incident to higher- 
weight edges (see Figure 1(c)). Consequently we propose 
the following scoring scheme for conserved adjacencies: 



■ w[ej]) if[C.'i|i], Ci[;]) and (G'2[fe], C'll^]) form a conserved adjacency 
otherwise 



(1) 



In our matching we want to promote conserved adja- 
cencies but also edges: Because in the presented 
approach, connected components are larger than gene 
families, we aim to match more than one pair per con- 
nected component, even in the case they do not exhibit 
adjacencies. Hence we quantify the quality of a matching 
J\/i according to the following functions, where i, j indi- 
cate indices in genome Gi; k, t in G^. 



ad)iM)= ^ s[i,i,k,l) 



0<i<;<\Gi\, 
0<fci<|G2| 



:(M)= ^ 



(2) 



(3) 



Notice that the edge weights in the sum of the 
Equation 3 are squared to match the dimension of 
Equation 2. Optimizing a matching with respect to 
edg{M) will result in a maximal weighted matching 
in the graph model we introduced above. As our over- 
all objective function we propose a linear combination 
between Equations 2 and 3. We allow the user to bal- 
ance between those two quantities by a parameter a. 
Moreover it is reasonable to add the constraint that at 
least one edge per connected component of the bipar- 
tite graph between Gi and G2 must be contained in 
the matching; The matching obtained is an intermedi- 
ate matching. 

Problem 1 (Family-free(FF)-Adjacencies) Given two 
genomes Gi and G2, a normalized similarity measure a, 



and some aL ]0, I], find a matching J\Ain B = (Gi, G^iE) 
such that at least one edge per connected component ofB 
is contained in Mand the following formula is maxi- 
mized: 



Ta (M) = a ■ ad] (M) -1- (1 - a) • edg (M) . 



(4) 



Problem FF-Adjacencies can be reduced to two pro- 
blems that were addressed already by Tang and Moret 
[12] and Angibaud et al. [13]. Therefore, let us consider 
equivalent conditions that prevail if gene families are 
given: In the bipartite graph B = (Gi, G2,E) between two 
genomes Gi and G2 all edges have edge weight 1 and all 
connected components are cliques. Then finding a solu- 
tion to Problem FF-Adjacencies with a = 1 is equivalent 
to finding a matching that maximizes the number of adja- 
cencies between two genomes with duplicate genes under 
the intermediate model [13]. If a comes close enough to 0, 
we will obtain a maximum matching, yet maximizing the 
number of adjacencies [12]. The case where family condi- 
tions are met also reveals the difference between an arbi- 
trary maximum matching and the maximum matching 
found by solving Problem FF-Adjacencies for a ^ 0. 

The reduced problems presented above being already 
NP-hard, the problem FF-Adjacencies is NP-hard as well. 
In the next two subsections we propose first an exact algo- 
rithm, FFAdj-Int, to solve Problem FF-Adjacencies and 
then a fast heuristic approach. 

Exact algorithm 

Our algorithm FFAdj-Int solving Problem FF-Adjacen- 
cies is based on previous work in [13]. The idea is to 
translate the problem into a 0-1 linear program. That 
means we define a set of constraints (linear inequations) 
whose variables are booleans and an objective function 
(maximization or minimization of a linear formula). 
Then, we use a solver to assign a value for each variable 
such that the constraints are verified and the objective is 
optimized. 

The program FFAdj-Int considers two linear genomes 
Gi and G2 of respective lengths «i and «2) a number 
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a e ]0, 1], and a function o : Q x Q ^ [0,1]- The objec- 
tive, the variables and the constraints are defined in Fig- 
ure 2 and are briefly discussed hereafter. 
Variables: 

• Variables a{i, k), 0 < i < and 0 < ^ < «2> define a 
matching M'- a-i^k = 1 if and only if the gene at posi- 
tion i in Gi is matched with the gene at position k 
in Gi in J\A, ^ik e M- 



♦ Variables hjyi), x e {1, 2} and 0 < / < n„ represent 
the genes saturated by J\A'- hJJ) = 1 if and only if 
the gene at position i in Gx is saturated by the 
matching M- Clearly, So < i < «i ^i(i) = < k < ni 
b2(k), and this is precisely the size of the matching 
M. 

♦ Variables {i, J), x e {1, 2} and 0 < i < J < n^, 
represent consecutive pairs according to the match- 
ing M.: Cx(i, i) = 1 if and only if the genes at 



Program FFAdj-Int 
Objective : 

Maximize Q- E E s{i,3,k,i) ■ d{i,j,kj) + {\ - a) ■ E (T{Gi[i],G2[k\f ■ a{i,k) 

0<'<j<nj 0<k.l<7i2 0<i<ni 

n<k<n2 

Constraints : 

(C.Ol) V 0 < ?: < m, E a(i,k)=hi{i) 

V<k<n2 

V0<fc<n2, E a{i,k) = h2{k) 

•)<i<'M 

(C.02) V:re {1,2}, VCgC, E 6.r(0 > 1 

{C. 03) V X e {1,2},V 0<i < j <n^, c4i,j)+ E bxip)>l 

i<p<j 

(C.04) V X G {1, 2}, V 0 < / <p<j< 71^, c^ij) + t,(p) < 1 

(C.05) V 0 < i < i < n,, V 0 < A:, £ < n2, V A: 7^ f , a(Gi\i],G2[k]) > 0, a{Gx[j],G2[l]) > 0 
such that {k < £ and sgn{G\[i]) = sgn{G2[k]) and sgn{G\[j]) = sgn{G2[i])), 

or {k>e and sgn{Gi[{\) = -sgn{G2[k]) and sgniGi[j]) = -sgn{G2[e])), 
a{i, k) + a{j,e) + ci{i,j) + C2{k,£) - d(Qi,j, k,£)<3 
a{i,k) -dii,j,k,e) > 0 
a(j,e)-d{i,3,k,£) > 0 
ci{i,j) -d{i,j,k,e) > 0 
C2{k,£)-dii,j,k,£) > 0 

(C.06) V 0 < « < J < m, V 0 < A:,£ < n2 
such that k = £, 

or (T(G,[i],G2[^i) = 0, 
or a(Gj[i],G2K])=0, 

or {k<£ and {sgn{Gi[i]) ^ sgn{G2[k]) or sgn{G^[3\) ^ sgn{G2[£\))), 
or {k>£ and (s5n(Gi[i]) ?^ -sgn{G2[k]) or S3n(Gi[j]) / -S5n(G2M))), 

d{i, j, kj.) = 0 

Domains : 

Vie {1,2},V 0 < i < i < ni.V 0 < k,e< 712, A; / ^ 
a(i,A;), 6x(i), Cx(j,j), d{i, j, k, £) £ {0,1} 



Figure 2 Program FFAdj-Int. Program FFAdj-Int for finding an intermediate matching that maximizes the objective (a e ]0, 1]) between 
two genomes G, and G2. 
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positions i, j in Gx are saturated by M. and no gene 
at position p, i<p<j, is saturated by M.- 

♦ Variables d(i, j, k, e), 0 < i < j < rii, 0 < k, £ < «2. repre- 
sent conserved adjacencies according to the matching 
M- d{i, j, k,e) = \ if and only if s{i, j, k, i) > 0. 

Because the matching is possible only between similar 
genes, the variables a{i, k) and d{i, j, k, €) are not 
defined whenever a(Gi[i], = 0. Similarly, the vari- 

ables d{i, j, k, €) are not defined if o{Gi\i],G2[i]) = 0. 

Objective: 

The goal of FFAdj-Int is to find a matching M 
between the two considered genomes that maximizes 
the formula Ta i<X e ]0, 1]). Hence, the objective of 
FFAdj-Int reduces to maximizing the sum of all vari- 
ables d{i, j, k, £) multiplied by a ■ s{i, j, k, £), plus the 
sum of all variables a{i, k) multiplied by (1 -a) ■ a{i, k)^ 

Constraints: 

Assume x e {1, 2}, Q < i < j < rii and 0 < k, £ < n2- 

♦ Constraints in (C.Ol) ensure that each gene of Gi 
and of G2 is saturated at most once, i.e. bi{i) = 1 
(resp. b2(k) = 1) if and only if there exists a unique k 
(resp. i) such that a{i, k) = 1, i.e. e J\A.- 

♦ Constraints in (C.02) ensure that the matching J\A 
is an intermediate matching, we want for each com- 
ponent at least one edge in the matching J\A. For 
each component C e C, the sum of the variables hx 
(i) for i e Cx must be greater than or equal to 1. 

♦ Constraints in (C.03) and (C.04) express the defi- 
nition of consecutive pairs, thus fixing the values of 
the variables Cx. The variable Cx(i, J) {0 < i < j < 

is equal to 1 if and only if there exists no p such 
that I < p < i and bx(p) = 1. It is worth noticing that 
the constraints do not force the variables c^(/, /) to 
have exactly the values we intuitively wish according 
to the above mentioned interpretation. Here, we 
accept that Cx{i, j) = 1 even if the gene at position / 
or i is not saturated. However, this will pose no pro- 
blem in the sequel. 

♦ Constraints in (C.05) and (C.06) define variables d. 
Knowing the variables d{i, j, k, £) are defined only if 
o{i, k) > 0 and o{j, £) > 0, constraints (C.05) and 
(C.06) ensure that we have d{i, j, k, £) = 1 if and 
only if all variables a{i, k), aij, e), Ci(i, j) and C2{k, e) 
are equal to 1 and the signs and the order of Gi[/], 
Gi[/'], G2[k] and G2[£] are consistent with the defini- 
tion of conserved adjacencies. 

The program FFAdj-Int has 0((«i«2)^) constraints 
and 0((«i«2)^) variables, which could result in a time- 
consuming computation. 



So far we have used only one simple rule in order to 
reduce the space complexity: By the definition of the 
intermediate model, for all components with only two 
genes, Gi[/] and G2[k], the edge en^ is in J\4- By the con- 
straints (C.Ol) and (C.03), we already enforce that the 
variables a{i, k), bi(i) and b2(k) are equal to 1. The rule is 
based on the fact that there is no possible consecutivity 
in M between Gi[s] and Gi[t] (resp. G2IS] and G2[t]) 
such that 0 < s < / < i < «i (resp. 0<s<k<t< «2)i i-^- Ci 
(s, t) (resp. £2(5, t)) is equal to 0. The corresponding vari- 
ables d(s, t, ., .) (resp. d{., ., s, t) and d{., ., t, s)) are also 
equal to 0. 

Heuristic 

Because of the combinatorial explosion, FFAdj-Int does 
not solve Problem FF- Adjacencies for all pairs of com- 
plete, larger genomes. But, we will see in the "Experiments 
and discussion" section that FFAdj-Int allows to obtain 
enough results to evaluate our heuristic presented in this 
section. It is based on similar ideas as the heuristic IILCS 
in [13]. IILCS allows to compute the number of adjacen- 
cies between two genomes when gene families are known, 
under three models: exemplar (only one match by gene 
family), intermediate, and maximum. IILCS resolves our 
Problem FF-Adjacencies in the particular case where a = 
1 and each component represents a gene family, i.e. each 
component is a clique where the weight of each edge is 1. 

The heuristic IILCS is a greedy algorithm based on the 
notion of LCS, Longest Common Substring: Given two 
genomes Gi and G2, an Z,C5 is a longest string S such that 
5 is a (consecutive) substring in Gi and G2, up to a com- 
plete reversal (opposite sign and reverse order). The idea 
is to match, at each iteration, all the genes that are in an 
LCS. If there are several LCSs, one is chosen arbitrarily. At 
each iteration, not only we match an LCS, but we also 
remove each unmatched gene from the genome, for which 
there is no unmatched gene of same component in the 
other genome. The process (determination of LCS, match 
and deletion of genes) is iterated until a satisfying match- 
ing is obtained. Under the intermediate model, the itera- 
tion is stopped when there is at least one edge in M for 
each component. 

For the problem FF-Adjacencies,we update the IILCS 
heuristic by three modifications. The goal of the first 
change is to take into account our objective in the choice 
of common substrings. In each iteration we match the 
common substring that maximizes locally J^a i<X e ]0,1]), 
i.e. the sum of weights of adjacencies and edges. We call 
this common substring a Maximum Common Substring 
(MCS). The second modification is an improvement that 
may also be applied to the original IILCS heuristic: After 
the deletion of an unsaturated gene g^, such that there is 
no unmatched gene g2 with o(gi,g2) > 0, we attempt to 
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increase the size of each previously matched MCS by 
extending it on both extremities. The next and the last 
change is related to the model. We have two options to 
increase our objective. The first one is to stop the itera- 
tion only when we have at least one edge per component 
and when the size of the MCS of the current iteration is 
below 2. In the case of the gene family constraints, this 
criterion improves also the results of IILCS. The second 
possibility is to stop the iteration only when there is no 
more edge between unmatched genes. In comparison to 
the first possibility, we increase our objective J^a (cc e ] 
0,1]) only if a ^ 1, so not in the context of IILCS. We 
choose this second possibility because the objective is 
bigger, but it is important to understand that then we 
also increase the number of breakpoints. We call this 
heuristic FFAdj-MCS. 

Experiments and Discussion 

Data 

To evaluate our algorithms, we chose 12 y-proteobacteria 
genomes from the dataset of Lerat et al. that was also used 
in previous work and which is to some degree considered 
as a standard reference dataset in comparative genomics 
[13-15]. The suggested phylogeny has been confirmed in 
later studies e.g. Williams et al. [16]. The genomic data 
including gene annotations have been obtained from 
NCBI under the accession numbers listed in Table 1. 

All genomes comprise a single, circular chromosome. 
In support of simplified code but at the expense of 
accuracy, our implemented algorithms do not allow a 
chromosome to be circular, even though this is per- 
mitted by our presented model. However, the maximal 
error made by this inaccuracy in comparing two gen- 
omes is at most one adjacency, which is negligible in 
our analysis. The genomes were linearized in the order 
inherent to the NCBI data, and telomeres were added at 
the beginning and at the end of the resulting chromoso- 
mal sequences. 



Pairwise normalized similarities were obtained using 
the relative reciprocal BLAST score (RRBS) [17]. Genes 
were compared on the basis of their encoding protein 
sequence using BLASTP with an e-value threshold of 
0.1, disabled query sequence filtering, and disabled com- 
position-based score adjustments. All computations 
were performed on a computer system with 32 gigabytes 
of main memory. 

Exact Algorithm vs Heuristic 

Using the CPLEX solver we ran the 0-1 linear programs 
obtained by FFAdj-Int for 66 pairs of genomes with vary- 
ing values of a (a e {0.001, 0.3, 0.5, 0.8, 1}). A detailed 
table of results is enclosed in Additional file 1. In some 
cases, in particular larger and close genomes, we were not 
able to obtain results due to the lack of sufficient memory; 
We obtained results for 43 (resp. 63, 61, 54 and 48) pairs 
of genomes for a = 0.001 (resp. 0.3, 0.5, 0.8 and 1). For 40 
pairs of genomes we could solve the 0-1 linear program 
for all values of a. These results are summarized in Figure 
3, showing mean values of J^«(A^), adj(M)> and edg(M) 
plotted as a function of a. The same plot also includes 
results of our heuristic showing similar trends. Both, adj 
{M), and edg{M) show little change while varying a. This 
indicates that the set of high-scoring adjacencies and high- 
weight edges, that contribute the most, are largely shared 
among the matchings with different a. The abrupt drop in 
the mean value of edg{J\A) for a = 1 results from the fact 
that for this value the second term of Equation 4 drops 
out. Consequently, single pairs (i.e. those that do not share 
conserved adjacencies) of matched genes (Gi[i], G2[A']) are 
removed from the genomes during the resolution of the 
linear program. On the other side, because the heuristic 
iteratively constructs a matching until full saturation, the 
value for edg{J\/i) for a = 1 remains high. 
In our evaluation of FFAdj-MCS, we demonstrate that it 
represents a feasible heuristic for Problem FF- Adjacencies. 
In Table 2 the relative deviation of the heuristic results 



Table 1 Genomic dataset. 



Species/strain name 


Short name 


Accession No. 


Size (bp) 


#Genes 


Buchnera aphidicola APS 


BAPHI 


NC_002528 


640681 


564 


Escherichia coli K1 2 


ECOLI 


NC_000913 


4639675 


4320 


Haemophilus influenzae Rd 


HAEIN 


NC_000907 


1830138 


1657 


Pseudomonas aeruginosa PAOl 


PAERU 


NC_002516 


6264404 


5571 


Pasteurella multodda Pm70 


PMULT 


NC_002663 


2257487 


2012 


Salmonella typhlmurlum LT2 


SALTY 


NC_003197 


4857432 


4423 


Wigglesworthia glosslnidia brevipalpis 


WGLOS 


NC_004344 


697724 


611 


Xanthomonas axonopodls pv. citri 306 


XAXON 


NC_00391 9 


5175554 


4312 


Xanthomonas campestrls 


XCAMP 


NC_003902 


5076188 


4179 


Xylella fastidlosa 9a5c 


XFAST 


NC_002488 


2679306 


2766 


Yersinia pestis CO_92 


YPEST-C092 


NC_003143 


4653728 


3885 


Yersinia pestis KIMS PI 2 


YPEST-KIM 


NC_004088 


4600755 


4048 



The genomic dataset of our analysis comprises 12 7-proteobacteria from Lerat ef al. [14]. 
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Figure 3 Solving FF-Adjacencies for various values of a Mean values of J^aLM), odj{_M) and edg(_M) plotted as a function of a for 40 

genomes based on results of FFAdj-Int (solid lines) and FFAdj-iVICS (dashed lines). 



from the solutions of FFAdj-Int are listed. In the worst 
case, where a = 1, the heuristic deviates in the objective by 
less than 10%. Due to its greedy nature, in all cases but one 
the size of the matching is larger than in the optimal solu- 
tion. In order to evaluate the gain of the family-free analy- 
sis, we compare the results of FFAdj-Int against those of 
Adjacencies-Intermediate-iVIatching [13]. The linear pro- 
gram Adjacencies-Intermediate-Matching maximizes the 
number of adjacencies under the intermediate model 
between two genomes with gene family constraints. To 
compare the number of adjacencies (a common measure 
of these two programs) correctly, we must take into 
account two facts. First, the number of genes of the stu- 
died genomes differ. In [13], the authors used gene anno- 
tations and gene families that are reported in [15] whereas 
in our current study we employed gene annotations from 
NCBI. Nevertheless, the difference in number of genes is 
on average 0.02% per genome. Secondly, the genes for 
Adjacencies-Intermediate-Matching are unsigned, 
which artificially increases the number of adjacencies. We 
observed many more adjacencies in the results of FFAdj- 
Int and of FFAdj-MCS than in Adjacencies-Intermedi- 
ate-Matching. Furthermore, the matching produced by 

Table 2 Exact algorithm vs heuristic. 



Relative deviation RF distance 

#adj #edg #exact results exact heuristic 



0.001 


-2.67% 


2.83% 


-0.23% 


43 


2 


2 


0.3 


-347% 


0.90% 


0.31% 


63 


2 


2 


0.5 


-4.26% 


-1.03% 


0.84% 


61 


2 


2 


0.8 


-6.34% 


-1.71% 


1.14% 


54 


4 


2 


1 


-841% 


-2.39% 


1 7.7% 


48 


6 


2 



Comparison of FFAdj-Int and FFAdj-MCS. The left side of the table shows the 
relative deviation in the objective function J-ai.^A^' the number of 
adjacencies (#adj), and the size of the resulting matching (#edg), of heuristic 
results from the exact solutions under varying values of a. On the right, the 
RF distances between the true tree and trees based on exact and heuristic 
results are listed. 



both FFAdj-Int and FFAdj-MCS is on average larger 
than in Adjacencies-Intermediate-Matching. 

Evaluating phylogenies 

A good indicator for accuracy of a genome-based dis- 
tance measure is the quality of the phylogenetic tree 
based on its drawn distances. 

The distance measure that we used in this analysis 
resembles the breakpoint distance normalized by the size 
of the matching. For a given matching ^ of a size #edg 
{/A) and a given number of adjacencies #adj(A^), the 
normalized number of breakpoints is (#edg(A^) - #adj 
{A4) - l)/#edg(A^). Now, since the objective of Problem 
FF-Adjacencies does not maximize adjacencies but 
rather a linear combination of adj(J\4) and edg{M)> we 
define a distance measure based thereon: 



A(M) 



\{M)-adj{M) - 1 
edg(,M) 



In our evaluation we applied the well-known Neighbor 
Joining Method (NJ) [18] for inferring phylogenetic 
trees. Subsequently we compared these to the tree pro- 
posed by Lerat et al. [14] that we assume to represent 
the true phylogeny. Thereby we used the Robinson 
Foulds topological distance (RF distance) [19] to evaluate 
our inferred phylogenetic trees. The results are shown 
on the right side of Table 2. For the majority of all cases 
we were able to reconstruct the tree correctly up to a 
single internal edge, causing an RF distance of 2 to the 
original tree. This internal edge connects the two organ- 
isms Buchnera aphidicola (BAPHI) and Salmonella 
typhimurium (SALTY) with the rest of the tree (Figure 
4(a)). This branch is known to be particularly hard to 
reconstruct since the two organisms diverged far from 
each other, resulting in two long external edges in the 
tree. We also reconstructed the phylogeny based on 
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gene families using the measures that were proposed in 
[13] and obtained an RF distance of 2 to the true tree 
under the intermediate model and an RF distance of 4 
under the maximum matching model, featuring the 
same aberrancy. These results suggest that gene family 
information is not relevant in reconstructing the phylo- 
geny of Lerat et al.'s tree. Yet, the increasing deviation 
from the true tree for the results of FFAdj-Int when a 
tends to 1 indicates that for genome comparison, the 
maximization of adjacencies is not enough. The fact 
that the heuristic outperforms the exact algorithm for 
a = 1 in terms of RF distance to the true tree confirms 
the importance of maximizing edg{J\4) as well. We recall 
that FFAdj-MCS iterates until complete saturation is 
obtained, which increases edg{J\/(), even when a = 1. 

Often, one cannot judge the tree-additivity of the 
underlying distances by investigating the fully resolved 
Neighbor Joining tree. Thus, in Figure 4 we provide a 
NeighborNet [20] representation of some of our obtained 
phylogenies. In the plots the internal edges that are hard 
to reconstruct are directly exposed, showing network-like 
rather than tree-like structures, in particular for the tree 
obtained from [13]. To conduct these phylogenetic ana- 
lyses, we used the software packages PHYLIP [21] and 
SPLITSTREE [22]. 



Conclusions 

In this work, we introduced the concept of comparative 
genomics by direct analysis of gene similarities without 
prior assignment of gene families. To illustrate this 
approach, we resorted specifically to one problem of gene 
order comparison: Finding a matching that identifies 
similarities between two genomes by maximizing con- 
served adjacencies and similarities for each pair of genes 
simultaneously. This problem is NP-hard. We propose to 
resolve it by an exact algorithm (efficient for small gen- 
omes) and a good heuristic. In our experiments on 12 y- 
proteobacterial genomes, we observed that the omission 
of gene families allowed for an increase in the number of 
adjacencies as well as the size of the matching while the 
resulting distances gain higher precision in reconstruct- 
ing phylogenies. 

Future work. This study is a preliminary work in a new 
field of comparative genomics wherein the assignment of 
gene family is unnecessary. Many studies can be explored. 
With regard to the specific problem studied here, our 
exact algorithm can be improved by rules which reduce 
the required main memory. Moreover, we believe that a 
hybrid heuristic - starting a pre-matching using the itera- 
tive heuristic until the size of the MCS is less than a para- 
meter k, then finishing the matching with our exact 
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Figure 4 Phylogenies. (a) True phylogeny obtained from [14]. NeighborNet representations of phylogenies based on distances obtained by (b) 
Adjacencies-Intermediate-Matching [13], (c) running FFAdj-Int with a = 0.001, (d) a = 0.5, (e) a = 0.8, and (f) running FFAdj-MCS with a = 
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algorithm - can allow to find near-exact results for even 
larger genomes. On the other side, a deep study of the 
measure a can increase the quality of the comparison; 
comparing genes by sequence similarity is only one of 
many methods that can be applied. 

From a more general point of view, this study shows 
that it is conceivable to extend the direct analysis 
approach to other types of gene order studies such as 
the computation of DCJ distances or gene cluster 
prediction. 

Additional material 



Additional file 1: Measured adjacencies between 12 y-proteobacterial 
genomes. Values obtained from FFAdj-Int and FFAdj-MCS. #adj 

denotes the number of conserved adjacencies in the matching and 
#edg indicates the number of its edges. X indicates that the exact 
calculation did not terminate due to the lack of sufficient memory. 
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